On October 7, 2023, Hamas and other Palestinian armed groups orchestrated a deadly attack on Israel. The horrific attack killed 1,200 people, with over 200 hostages seized and over 100 still unaccounted for. Israeli forces began airstrikes and ground operations in response
- The ongoing conflict has devastated the civilian population of Gaza. Seventy five percent of the population of Gaza has been displaced, most multiple times, and the entire population is in need of humanitarian assistance. The ongoing conflict, bombardment and blockade has led to catastrophic humanitarian suffering for more than 2 million Palestinians—half of them children—who are now without clean water, food and vital medical services.
- Since the beginning of the war in October 7th, 2023, Airwars has monitored open source civilian harm incidents in Gaza. Incident is defined as an explosive weapon or ground battle operation that produced civilian casualties or harm. Civilian status of victims is assumed unless there is information determining their militant status. Data is derived from tweets or Facebook postings, translated from Arabic-to-English and when names are provided they are counted as a casualty. These names are corroborated with the Hamas MoH when they are released. All data is presented on their website, and each incident has its own web-page.
Aim of Study
The ongoing Gaza-Israel conflict is likely to persist in 2025 threatening to destabilize the region—while affecting the well being of children and women—and the strategic utilization of open-source data can empower policymakers to mitigate humanitarian challenges in the region. Gaza’s death toll estimates at over 43,000, with children and women representing a sizable proportion, as reported by the Hamas’ Ministry of Health (MoH), yet it cannot be corroborated by outside organizations. Airwars, an NGO, has scrapped social media platforms since October 2023 from users posting details about the death of friends, family, and neighbors. We aim to use the Airwars’ Gaza conflict tracker to assess to what extent can open-source data be used to identify patterns in the targeting of civilian infrastructure as well as approximate official casualty estimates?
Our methodology consists of scrapping incident reports from the Airwars website containing information about civilian casualties. We intend to derive from this data a casualty rate for children and women to explore how it approximates official MoH estimates.
Airwars data also contain summaries describing in detail the incidents. We leverage huggingface to examine changes in the emotional tone of these summaries and couple them with the children and women casualty rate to provide more context into these estimates.
Coordinates of incidents are sometimes provided, and we use this information to submit queries to the openstreetmap api to return additional information about what these locations were (i.e., schools, mosques, government buildings, etc). This information will allow us to explore what parts of Gaza’s infrastructure has endured the most destruction, through static map plots or interactive map visuals with the ability to filter through the types of locations.
We enrich the Airwars database by joining the MoH casualty daily counts, which we plot with our casualty estimate to explore the discrepancy. We also join to the Airwars database the Armed Conflict Location and Event Data (ACLED) containing characteristics of Israel’s operations to explore if the incidents in Airwars correspond with the Israeli operations in Gaza reported in ACLED, and this will be used as as a validation check of our data.
We will use the following packages and project code can be found in our github repository. Additionally we used the text, data.table, and jsonlite packages.
pacman::p_load(lubridate, RSQLite, knitr, DBI, ggthemes, viridis, tidyverse)
We also store all of our data in a SQLite database that can also be found in our github repository
The scrapped data resulted in over 800 unique events stored across four tables in a SQLite database. The first table contains incident metadata (e.g., unique id, incident date, web-page URL). The second table stores the specific incident information such as the number of deaths, breakdown of deaths (children, adults), type of attack, and cause of death. The third table includes incident coordinates and results from OSM, while the last table stores incident summaries and sentiment scores for seven emotional states. These tables relate to each other through the unique incident identification numbers provided by Airwars.
Additionally, we have two more tables that include the data from Palestine Data daily casualties and the ACLED data.
# connect to database
mydb <- dbConnect(
RSQLite::SQLite(),
"~/repos/airwars_scraping_project/database/airwars_db.sqlite")
Scraping Airwars Civilian Casualties Incidents
- The image below is an example of the Airwars incident metadata that are presented at baseball cards. This information is all presented in one web-page and we start our workflow at this junction. We read the main website in Airwars that houses this information and only scrape Incident Date and Incident ID to build specific incident web-urls that we later scrape for content.
- All of the code to conduct the scraping and processing of these data are found in our github page under code/scrape_process_incidence, which this code has been optimized and takes about 20 minutes on a modern laptop (32gb of RAM is sufficient) with fast internet connection. We will describe the use of a laptop GPU to process the sentiment analysis.
- Here we only explain how we pre-processed the data as it related to preparing for analysis.
- For a lot of the scraping we used selectorgadget to get the xpath and pass it through the Rvest package.
- Metadata table: We scrape the main Airwars website parse information we need to build a table containing each incident’s web-url (over 800 URLs), as seen in the example below.
tbl(mydb, "airwars_meta") |>
as_tibble() |>
slice_sample(n=5) |>
kable()| Incident_Date | Incident_id | link |
|---|---|---|
| 19653 | ispt0508 | https://airwars.org/civilian-casualties/ispt0508-october-23-2023/ |
| 19647 | ispt0332a | https://airwars.org/civilian-casualties/ispt0332a-october-17-2023/ |
| 19637 | ispt00013 | https://airwars.org/civilian-casualties/ispt00013-october-7-2023/ |
| 19654 | ispt0559 | https://airwars.org/civilian-casualties/ispt0559-october-24-2023/ |
| 19642 | ispt0162 | https://airwars.org/civilian-casualties/ispt0162-october-12-2023/ |
- Each incident contains an assessment section detailing what transpired during the incident, whom was known to be involved and the victims it produced. We will use this text to get emotional scores later.
Assessment table:Using the web-urls we built in the metadata table, we loop through them and scrape each url to parse incident assessments as seen below.
Sentiment analysis. After attempting several text classification models and some question/context model we landed onj-hartmann/emotion-english-distilroberta-base because it goes beyond just a positive/negative evaluation but analysis text for Ekman’s 6 basic emotions that is common in psychological work on emotions.Moreover, this model affords us the ability to examine the emotion tone over time for these assessments.1
We get scores for each emotions, the closer to one the stronger the association, while all the scores add up to 1.
The model is trained on a balanced subset from the datasets listed above (2,811 observations per emotion, i.e., nearly 20k observations in total). 80% of this balanced subset is used for training and 20% for evaluation. The evaluation accuracy is 66% (vs. the random-chance baseline of 1/7 = 14%).
Given that we have over 800 assessments we decided to use text2 while it allows us the ability to use a laptop GPU (GTX 4070)3 to process these models for each incident. This resulted in large processing gains reducing text analysis from two hours to 10 minutes.
assessment <- tbl(mydb, "airwars_assessment") |>
as_tibble() |>
slice_sample(n=1)
str_extract(assessment$assessment,"^.{1000}") |>
kable(col.names = "Assessment") | Assessment |
|---|
| An alleged Israeli airstrike that hit the residential apartment of the Abu Shamala family in the Al-Nasmawi neighbourhood of Khan Yunis during the evening of October 9th 2023 killed between four and five members of the family, including at least three children, and injured up to seven others., Several local sources reported that multiple members of the Abu Shamala family were killed. Four of those killed have been identified as Fatima Fauzi Abu Shamala (39 years old) and her children – her two daughters, 16 year old Tasnim Ibrahim Abu Shamala and 6 year old Yasmine Ibrahim Abu Shamala, and her son, 14 year old Mahmoud Ibrahim Abu Shamala. One other son, Elias Ibrahim Abu Shamala, was mentioned in a Facebook post by his uncle as being in serious condition while one of the family’s neighbours posted that all four children were killed. Accordingly, Airwars has recorded the number of civilians killed as 4-5, and injured as 0-1., Tasnim’s teacher posted a message memorializing her, and ment |
assessment |>
select(-assessment) |>
mutate(across(where(is.double), ~ round(.x, 3))) |>
kable()| Incident_id | anger | disgust | fear | joy | neutral | sadness | surprise |
|---|---|---|---|---|---|---|---|
| ispt0062 | 0.107 | 0.051 | 0.037 | 0.003 | 0.065 | 0.714 | 0.024 |
Coordinates table: Airwars when possible includes location coordinates of where the incident took place. Although this information is contained within the assessment, Airwars standardizes it’s location with a heading under “Geolocation notes” which we were able to parse the latitude and longitude to use for geographic plotting. Of the 804 Incident events, only 489 of them had geographic notes.
- We used the Nominatim open street map API to reverse geocode our coordinates and bring back the type of location that was the location target for incidents that contained coordinates. We save this information in the table below as type_location.
tbl(mydb, "airwars_coord") |>
as_tibble() |>
slice_sample(n=8) |>
kable()| Incident_id | lat | long | type_location |
|---|---|---|---|
| ispt0187ga | 31.43793 | 34.40386 | pharmacy |
| ispt0462 | 31.29646 | 34.24376 | secondary |
| ispt0540 | 31.54231 | 34.49516 | school |
| ispt0320 | 31.33930 | 34.30615 | school |
| ispt0159 | 31.54613 | 34.51733 | school |
| ispt0220 | 31.37043 | 34.33453 | residential |
| ispt0093 | 31.34615 | 34.30390 | secondary |
| ispt0483 | 31.55128 | 34.50923 | track |
- Incidence table: Our final table contains the fields that Airwars populates for each incident. Besides parsing this information we also had to process the data, specifically, there are fields that contain ranges of kills (i.e., 3-5) or counts (i.e., 1 child, 3 women, 1 man) which we had to strip these strings into their own columns. This allows us to estimate how many children and women have been reported as civilian casualties. Our data contains 24 variables with 804 incidents reported by Airwars.
tbl(mydb, "airwars_incidents") |>
as_tibble() |>
glimpse()Rows: 804
Columns: 24
$ Incident_id <chr> "ispt280824r", "ispt240824l", "ispt220…
$ `Strike status` <chr> "Likely strike", "Single source claim"…
$ `Strike type` <chr> "Airstrike, Drone Strike", "Ground ope…
$ `Civilian infrastructure` <chr> "School", NA, "School", "Residential b…
$ `Civilian harm reported` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Ye…
$ `Civilians reported killed` <chr> "10 – 13", "Unknown", "2", "4 – 7", "1…
$ `Civilians reported injured` <chr> "7–24", "1", NA, "13–18", "12–24", "54…
$ `Cause of injury / death` <chr> "Heavy weapons and explosive munitions…
$ `Airwars civilian harm grading` <chr> "Fair", "Weak", "Fair", "Fair", "Fair"…
$ Impact <chr> "Education", NA, "Education", "Educati…
$ `Suspected belligerent` <chr> "Israeli Military", "Israeli Military"…
$ min_killed <int> 10, NA, 2, 4, 16, 77, 6, 12, NA, 26, 2…
$ max_killed <int> 13, NA, 2, 7, 19, 125, 6, 13, NA, 27, …
$ casualty_estimate <chr> "range", NA, "absolute", "range", "ran…
$ killed <int> 11, NA, 2, 5, 17, 101, 6, 12, NA, 26, …
$ `Known belligerent` <chr> NA, NA, NA, "Israeli Military", "Israe…
$ `Known target` <chr> NA, NA, NA, "Palestinian Forces", NA, …
$ `Suspected target` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Causes of injury / death` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Suspected belligerents` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ children_killed <int> NA, NA, NA, 3, 11, 11, 2, 8, NA, 18, N…
$ women_killed <int> NA, NA, NA, NA, 1, 6, NA, 2, NA, 6, NA…
$ men_killed <int> NA, NA, 1, 2, 6, 60, 2, 2, NA, 15, NA,…
$ Civilian_type <chr> NA, NA, "(1 man)", "(2–3 children2 men…
External Data Sources Enriches Airwars
Palestine Dataset published daily Gaza casualty counts in a downloadable .csv file that they take from the Hamas MoH; however, they do not make a distinction whether a casualty was a civilian or militant so their numbers should be higher than what we derive from Airwars.4
- We download the CSV datafile and conduct mutations of new variables to get cumulative casualty counts for children and women.
tbl(mydb, "daily_casualties") |>
as_tibble() |>
mutate(ext_killed_men_cum = ext_killed_cum -
(ext_killed_children_cum + ext_killed_women_cum),
Incident_Date = lubridate::as_date(Incident_Date)) |>
select(Incident_Date, ext_killed_men_cum,
ext_killed_women_cum, ext_killed_children_cum) |>
pivot_longer(-Incident_Date) |>
mutate(cumulative_casualties = case_when(
name == "ext_killed_men_cum" ~ "Men",
name == "ext_killed_women_cum" ~ "Women",
name == "ext_killed_children_cum" ~ "Children"
),
name = factor(cumulative_casualties,
levels=c("Children",
"Women",
"Men" ))) |>
ggplot(aes(x=Incident_Date, y=value, fill=cumulative_casualties)) +
geom_area(position = "stack") +
ylab("Casualties") +
xlab("") +
guides(fill=guide_legend(title="Cumulative Casualties")) +
scale_fill_viridis_d(option="E" ,alpha=.75, begin=.20) +
theme_hc()
ACLED tracks conflict events around the world, and here we use the Israel Defense Force airstrikes and ground operations in Gaza since October 2023. The data is provided in a downloadable .csv containing date and type of conflict and the actors involved.
- We only examine events where Israel forces are the aggressors in Gaza and use attacks on civilians classified as attacks, air/drone strike, shelling/artillery/missile attack, remote explosive/landmine/IED resulting in almost 6,000 incidents since October 2023.
- We only examine events where Israel forces are the aggressors in Gaza and use attacks on civilians classified as attacks, air/drone strike, shelling/artillery/missile attack, remote explosive/landmine/IED resulting in almost 6,000 incidents since October 2023.
daily_conflict <-tbl(mydb, "daily_conflict") |>
as_tibble() |>
# filter events as Israel forces as the aggresor
filter(actor1 == "Military Forces of Israel (2022-)",
civilian_targeting == "Civilian targeting") |>
mutate(Incident_Date = lubridate::as_date(Incident_Date))
daily_conflict |>
group_by(sub_event_type) |>
count() |>
ungroup() |>
mutate(proportion_incidents = round(n/sum(n), 2)) |>
kable()| sub_event_type | n | proportion_incidents |
|---|---|---|
| Air/drone strike | 4657 | 0.80 |
| Attack | 181 | 0.03 |
| Remote explosive/landmine/IED | 3 | 0.00 |
| Shelling/artillery/missile attack | 983 | 0.17 |
Footnotes
- The model is trained on a balanced subset from the datasets listed above (2,811 observations per emotion, i.e., nearly 20k observations in total). 80% of this balanced subset is used for training and 20% for evaluation. The evaluation accuracy is 66% (vs. the random-chance baseline of 1/7 = 14%).
An R-package for analyzing natural language with transformers from HuggingFace using Natural Language Processing and Machine Learning.↩︎
The installation for Text is tricky as the right python libraries must be installed. To compile models with the GPU, we learned that nvidia cuda drivers must be installed for version 12.1. Additionally, we could only get this to work via anaconda within Ubuntu 24.04 installed through WSL2 on Windows 11. Ubuntu 24.10 comes with a kernal that forces cuda 12.8 to be installed and did not work for us in a dual boot system.↩︎
Note. Confidence is low to moderate since the data comes from the Hamas MoH.↩︎